Skip to content

Conversation

@malberts
Copy link
Contributor

@malberts malberts commented Jun 1, 2025

Closes #39

This uses a library to do the HTML stripping.

It still runs into the context window limit: #30

Screenshot_20250601_172122

The wikitext there is actually present in the HTML: https://en.wikipedia.org/w/rest.php/v1/page/Earth/with_html. It did strip the HTML, although it looks like something went slightly wrong here:
Screenshot_20250601_172549

type: 'text',
text: `Text:\n${ stripHtml( result.html ).result }`
} );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this should be its own function

results.push( newFunction() )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts of this will get rewritten in #38 anyway.

@JeroenDeDauw
Copy link
Member

JeroenDeDauw commented Jun 1, 2025

Looks reasonable. I do wonder if info gets lost and if that is important. For instance, knowing that something is a list item in a bulleted list

@malberts
Copy link
Contributor Author

malberts commented Jun 2, 2025

I'm also not sure. At least right now we have source, HTML and plain text, so if your response looks wrong you can nudge it to use HTML.

I'm also slightly wondering about the utility of plaintext vs HTML when we still run into the context limit issue. The chunking workaround in #38 needs to be applied anyway. Although I guess having a smaller string is still better than a longer string.

The string-strip-html library supports not stripping some tags (and probably some other config I did not look at, including full control via callback), so we can tweak that. But then I wonder about the semantics of other tags, e.g. tables and emphasis. And if we then end up with partial HTML, do we still need plain text and full HTML? We might have to tweak some of the descriptions we provide in the tool definition, to default a generic "give me page content" to whichever is the most common.

@alistair3149
Copy link
Contributor

Do we need to strip the HTML at all knowing that the LLM can understand HTML?
If we are just looking for a generic give me page content tool, we can use a Markdown parser instead in the LLM will mostly process the Markdown syntax.

@malberts
Copy link
Contributor Author

malberts commented Jun 2, 2025

Stripping HTML was partially an attempt to reduce the size.

I don't know if there is a real requirement to render the page content in the LLM. It seems like more effort to convert HTML to markdown. And if the whole HTML fit in the context window, then I suspect you could just ask the LLM to render the HTML as markdown, or just render it in whatever way it is able to.

@alistair3149
Copy link
Contributor

alistair3149 commented Jun 2, 2025

The implementation looks good to me.

I am not sure if the plain text output is helpful since there are often templates that go into the HTML and wikitext, which can mess up the output. Some of the templates or other HTML elements can contain information important to the page.

Alternatively, it might be more useful for implement some kind of page summary extraction. Which is similar to the Page Content Service Summary endpoint on WMF wikis or the TextExtracts API

@malberts
Copy link
Contributor Author

malberts commented Jun 3, 2025

Getting a deterministic page summary might be useful (as opposed to asking the LLM to get the full text and summarize it). However, my understanding of TextExtracts is that it returns a snippet of the content, so it's not really a summarized version of the content if that snippet is not written as a summary. This might still be useful wrapped in a tool like get-page-text-extracts if you just need a part of the page.

But by itself, getting a summary/extract is not an alternative for when the full text is needed. The main question for this PR is whether there is a technical advantage in explicitly providing non-HTML content.

@JeroenDeDauw
Copy link
Member

Given that it is not clear if we need the plain text, we could keep the PR open for potential future continuation.

@malberts malberts marked this pull request as draft June 21, 2025 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add plaintext support to get-page

4 participants